Phoneme Embedding and its Application to Speech Driven Talking Avatar Synthesis

نویسندگان

Xu Li

Zhiyong Wu

Helen M. Meng

Jia Jia

Xiaoyan Lou

Lianhong Cai

چکیده

Word embedding has made great achievements in many natural language processing tasks. However, the attempt to apply word embedding to the field of speech got few breakthroughs. The reason is that word vectors mainly contain semantic and syntactic information. Such high level features are difficult to be directly incorporated in speech related tasks compared to acoustic or phoneme related features. In this paper, we investigate the method for phoneme embedding to generate phoneme vectors carrying acoustic information for speech related tasks. One-hot representations of phoneme labels are fed into embedding layer to generate phoneme vectors that are then passed through bidirectional long short-term memory (BLSTM) recurrent neural network to predict acoustic features. Weights in embedding layer are updated through backpropagation during training. Analyses indicate that phonemes with similar acoustic pronunciations are close to each other in cosine distance in the generated phoneme vector space, and tend to be in the same category after k-means clustering. We evaluate the phoneme embedding by applying the generated phoneme vector into speech driven talking avatar synthesis. Experimental results indicate that adding phoneme vector as features can achieve 10.2% relative improvement in objective test.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Talking Head System for Korean Text

A talking head system (THS) is presented to animate the face of a speaking 3D avatar in such a way that it realistically pronounces the given Korean text. The proposed system consists of SAPI compliant text-to-speech (TTS) engine and MPEG-4 compliant face animation generator. The input to the THS is a unicode text that is to be spoken with synchronized lip shape. The TTS engine generates a phon...

متن کامل

Real-time Speech Driven Avatar with Constant Short Time Delay

It is shown that the perception of speech is inherently multimodal [16][22]. Auditory-visual speech recognition is more accurate than auditory only or visual only speech recognition [1][10]. Research shows that a synthetic talking face can help people understand the associated speech in noisy environments [16]. It also helps people react more positively in interactive services [20]. In some sit...

متن کامل

Phoneme-level articulatory animation in pronunciation training

Speech visualization is extended to use animated talking heads for computer assisted pronunciation training. In this paper, we design a data-driven 3D talking head system for articulatory animations with synthesized articulator dynamics at the phoneme level. A database of AG500 EMA-recordings of three-dimensional articulatory movements is proposed to explore the distinctions of producing the so...

متن کامل

Photo-Realistic Talking-Heads from Image Samples

This paper describes a system for creating a photo-realistic model of the human head that can be animated and lip-synched from phonetic transcripts of text. Combined with a state-of-the-art text-to-speech synthesizer (TTS), it generates video animations of talking heads that closely resemble real people. To obtain a naturally looking head, we choose a “data-driven” approach. We record a talking...

متن کامل

Facial Expression Synthesis Based on Emotion Dimensions for Affective Talking Avatar

Facial expression is one of the most expressive ways for human beings to deliver their emotion, intention, and other nonverbal messages in face to face communications. In this chapter, a layered parametric framework is proposed to synthesize the emotional facial expressions for an MPEG4 compliant talking avatar based on the three dimensional PAD model, including pleasure-displeasure, arousal-no...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2016

Phoneme Embedding and its Application to Speech Driven Talking Avatar Synthesis

نویسندگان

چکیده

منابع مشابه

A Talking Head System for Korean Text

Real-time Speech Driven Avatar with Constant Short Time Delay

Phoneme-level articulatory animation in pronunciation training

Photo-Realistic Talking-Heads from Image Samples

Facial Expression Synthesis Based on Emotion Dimensions for Affective Talking Avatar

عنوان ژورنال:

اشتراک گذاری